A general framework for efficient clustering of large datasets based on activity detection

نویسندگان

  • Xin Jin
  • Sangkyum Kim
  • Jiawei Han
  • Liangliang Cao
  • Zhijun Yin
چکیده

Data clustering is one of the most popular data mining techniques with broad applications. KMeans is one of the most popular clustering algorithms, due to its high efficiency/effectiveness and wide implementation in many commercial/non-commercial softwares. Performing efficient clustering on large dataset is especially useful; however, conducting K-Means clustering on large data suffers heavy computation burden which originates from the numerous distance calculations between the patterns and the centers. In this paper, we propose framework GAD (General Activity Detection) for fast clustering on large-scale data based on center activity detection. Within this framework we design a set of algorithms for different scenarios: (1) Exact GAD algorithm E-GAD, which is much faster than K-Means and gets the same clustering result; (2) approximate GAD algorithms with different assumptions, which are faster than E-GAD while achieving different degrees of approximation; and (3) GAD based algorithms to handle the large clusters problem which appears in many large-scale clustering applications. The framework provides a general solution to exploit activity detection for fast clustering in both exact and approximate scenarios, and our proposed algorithms within the framework can achieve very high speed. We have conducted extensive experiments on several datasets from various real world applications, including data compression, image clustering and bioinformatics. By measuring the clustering quality and CPU time, the experiment results show the effectiveness and high efficiency of our proposed algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Framework for Building an Efficient Incremental Intrusion Detection System

In this paper, a boosting-based incremental hybrid intrusion detection system is introduced. This system combines incremental misuse detection and incremental anomaly detection. We use boosting ensemble of weak classifiers to implement misuse intrusion detection system. It can identify new classes types of intrusions that do not exist in the training dataset for incremental misuse detection. As...

متن کامل

GAD: General Activity Detection for Fast Clustering on Large Data

In this paper, we propose GAD (General Activity Detection) for fast clustering on large scale data. Within this framework we design a set of algorithms for different scenarios: (1) Exact GAD algorithm E-GAD, which is much faster than K-Means and gets the same clustering result. (2) Approximate GAD algorithms with different assumptions, which are faster than E-GAD while achieving different degre...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

A Pre-Trained Ensemble Model for Breast Cancer Grade Detection Based on Small Datasets

Background and Purpose: Nowadays, breast cancer is reported as one of the most common cancers amongst women. Early detection of the cancer type is essential to aid in informing subsequent treatments. The newest proposed breast cancer detectors are based on deep learning. Most of these works focus on large-datasets and are not developed for small datasets. Although the large datasets might lead ...

متن کامل

Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories

In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Statistical Analysis and Data Mining

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2011